Model Selection

Vision-Language Interaction

# Vision-Language Interaction

Qwen2.5 VL 7B Instruct Q8 0 GGUF

This model is a GGUF-format conversion of Qwen2.5-VL-7B-Instruct, supporting multimodal tasks and applicable to image and text interaction processing.

Text-to-Image English

Magma is a foundational multimodal AI agent model capable of processing image and text inputs to generate text outputs, with complex interaction abilities in both virtual and real-world environments.

Qwen2.5 VL 3B Instruct MLX 8bits

This is an 8-bit quantized version of the Qwen2.5-VL-3B-Instruct model, optimized for the MLX framework and supports image-text generation tasks.

Transformers English

AURORA is a video and simulation-based action and reasoning-centric image editing model, focusing on vision-language tasks.

Image Generation English

Llava Meta Llama 3 8B Instruct

A multimodal model integrating Meta-Llama-3-8B-Instruct and LLaVA-v1.5, providing advanced vision-language understanding capabilities

Internlm Xcomposer2 Vl 7b

InternLM-XComposer2 is a vision-language large model developed based on InternLM2, featuring outstanding image-text understanding and creation capabilities.

Instructblip Vicuna 7b 8bit

InstructBLIP-Vicuna-7B is a vision-language model based on Vicuna-7B, supporting image-to-text conversion tasks.

Mediocreatmybest

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase